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Abstract 

This  paper  describes  the  first  implementa¬ 
tion  of  WebWatcher,  a  Learning  Appren¬ 
tice  for  the  World  Wide  Web.  We  also  ex¬ 
plore  the  possibility  of  extracting  informa¬ 
tion  from  the  structure  of  hypertext.  We  in¬ 
troduce  an  algorithm  which  identifies  pages 
that  are  related  to  a  given  page  using  only 
hypertext  structure.  We  motivate  the  al¬ 
gorithm  by  using  the  Minimum  Description 
Length  principle. 

1  Introduction 

The  World  Wide  Web  is  growing  quickly  and  ad¬ 
dresses  more  and  more  users.  Although  a  lot  of  infor¬ 
mation  is  available  in  the  World  Wide  Web  (WWW), 
it  is  difficult  to  find  particular  pieces  of  information  ef¬ 
ficiently.  Many  have  noted  the  need  for  software  that 
helps  the  user  search  for  information.  This  paper  de¬ 
scribes  the  design  of  WebWatcher  [Armstrong  et  ah, 
1995],  an  agent  which  assists  users  in  locating  infor¬ 
mation  on  the  WWW  or  searches  autonomously  on 
their  behalf.  In  interactive  mode  WebWatcher  acts  as 
a  Learning  Apprentice  [Mitchell  et  ah,  1985]  [Mitchell 
et.  ah,  1994].  It  follows  the  user  on  his  or  her  way 
through  the  World  Wide  Web  and  suggests  hyperlinks 
whenever  it  is  confident  enough.  WebWatcher  learns 
by  observing  the  user’s  reaction  to  this  advice  as  well 
as  by  the  eventual  success  or  failure  of  the  user’s  ac¬ 
tions.  The  first  implementation  supports  only  this 
interactive  mode.  First  we  describe  the  design  and 
the  initial  implementation  of  WebWatcher.  We  then 
introduce  an  algorithm  which  is  used  in  WebWatcher 
to  suggest  pages  related  to  the  current  page. 

2  WebWatcher 

This  section  describes  the  design  of  WebWatcher  and 
how  it  assists  users  in  their  search  for  information. 
The  system  can  be  installed  easily  on  any  HTML-page 
by  inserting  a  hyperlink  to  the  WebWatcher  server. 
This  allows  having  multiple  instances  of  WebWatcher 
which  are  experts  for  certain  parts  of  the  WWW.  A 
user  enters  WebWatcher  by  clicking  on  a  hyperlink 
to  the  WebWatcher  server  and  can  specify  his  or  her 
interests  by  giving  keywords.  After  that  WebWatcher 
takes  the  user  back  to  the  page  from  which  he  or  she 
entered  the  system.  From  now  on  WebWatcher  fol¬ 
lows  the  user’s  actions  and  suggests  hyperlinks  using 

‘Email:  <first  name>.<last  name>@cmu.edu 


its  learned  knowledge.  It  also  offers  other  useful  func¬ 
tions.  The  user  can  leave  WebWatcher  at  any  time  by 
telling  the  system  whether  the  search  was  successful 
or  not.  WebWatcher  offers  the  following  functional¬ 
ity: 

1.  highlighting  hyperlinks  on  the  current  page, 
which  WebWatcher  deems  useful  according  to  the 
user’s  stated  interests 

2.  adding  new  hyperlinks  to  the  current  page,  based 
on  the  user’s  interests 

3.  suggesting  pages  related  to  the  current  page 

4.  sending  email  messages  to  the  user  whenever 
specified  pages  change 

Figures  1  to  5  illustrate  the  sequence  of  web  pages 
a  user  visits  in  a  typical  example.  Figure  1  shows  an 
HTML-page  about  machine  learning  in  which  we  in¬ 
serted  a  hyperlink  to  WebWatcher  (line  6).  The  user 
follows  this  hyperlink  and  gets  to  a  page  which  al¬ 
lows  her  to  identify  the  type  of  information  she  seeks. 
In  this  scenario  the  user  is  looking  for  a  publication 
and  selects  the  category  “paper” .  She  is  presented  a 
form  to  elaborate  the  information  request  (figure  2). 
The  user  can  fill  in  arbitrarily  many  keywords  or  leave 
Helds  blank.  After  that  the  user  is  sent  back  to  the 
page  from  which  she  entered  the  WebWatcher  system 
(figure  3). 

But  now  WebWatcher  is  “looking  over  her  shoul¬ 
der”  and  modifies  the  page  in  three  ways.  (1)  Web¬ 
Watcher  inserts  a  menubar  on  top  of  the  original  page. 
This  menubar  allows  the  user  to  invoke  additional 
functions  of  WebWatcher  or  to  terminate  the  search. 
(2)  WebWatcher  suggests  additional  hyperlinks  above 
the  menubar  (figure  3,  line  2).  (3)  WebWatcher  high¬ 
lights  hyperlinks  in  the  actual  page  which  seem  in¬ 
teresting  according  to  the  information  seeking  goal. 
The  system  highlights  hyperlinks  by  putting  “eyes” 
around  them  (figure  3,  line  13).  The  size  of  the  eye 
icon  is  a  measure  of  Web  Watcher’s  confidence  in  the 
advice.  In  our  example  the  user  follows  WebWatcher’s 
advice  and  takes  the  “LLPNET”  hyperlink.  She  ar¬ 
rives  at  the  page  shown  in  figure  4.  Until  the  user 
quits  the  search,  WebWatcher  will  insert  the  menubar 
into  the  original  page  and  give  advice.  While  Web¬ 
Watcher  suggests  which  hyperlinks  the  user  should 
take,  the  user  remains  firmly  in  control  and  may  ig¬ 
nore  the  system’s  advice  at  any  time.  We  think  this 
is  important  because  WebWatcher  may  provide  im¬ 
perfect  advice,  and  because  WebWatcher  might  not 
perfectly  understand  the  user’s  information  seeking 
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Figure  5:  Related  pages 


goal.  In  our  scenario  the  user  is  particularly  inter¬ 
ested  in  the  “ILPNet”  page.  So  she  clicks  on  the  but¬ 
ton  “ Mark  this  page  as  interesting ”  in  the  menubar. 
Web  Watcher  stores  this  information  and  returns  a  list 
of  10  pages  which  Web  Watcher  estimates  to  be  closely 
related  (figure  5).  The  user  can  leave  Web  Watcher 
at  any  time  by  clicking  on  “ Goodbye .  Information 
found.  ”  or  “ Goodbye .  I  give  up.  ”  in  the  menubar. 

3  Machine  Learning 

The  success  of  WebWatcher  depends  heavily  on  the 
quality  and  the  quantity  of  its  knowledge.  Because  of 
the  dynamic  nature  of  the  World  Wide  Web,  hand¬ 
crafting  and  maintaining  this  knowledge  seems  dif¬ 
ficult  .  Consequently,  we  are  exploring  methods  for 
aquiring  knowledge  automatically. 

The  World  Wide  Web  can  be  treated  as  a  directed 
graph: 

Gwww  =  {<  P,  Q  >  |  P  has  hyperlink  toQ ,  P,Q  £  P} 

V  is  the  set  of  HTML-pages,  the  nodes  of  the  graph. 
The  hyperlinks  define  the  edges.  Gwww  contains  an 
edge  from  page  P  to  page  Q  whenever  there  is  a  hy¬ 
perlink  in  P  pointing  to  page  Q. 

Two  kinds  of  knowledge  are  available  through  the 
WWW  graph.  There  is  knowledge  in  the  nodes  of 
the  graph  encoded  as  text.  We  have  begun  to  explore 
ways  of  using  this  text  for  guiding  search  [Armstrong 
et,  ah,  1995].  But  the  edges  -  the  hyperlinks  -  also 
contain  information.  They  show  relations  between 
nodes.  The  remainder  of  this  paper  describes  how  we 
can  extract  knowledge  from  the  edges  and  how  we  can 
use  this  knowledge  for  guiding  search. 


Figure  6:  Simple  example  of  WWW  structure. 


3.1  What  should  be  learned? 

Imagine  we  know  the  structure  of  the  World  Wide 
Web  without  being  able  to  understand  the  text,  in 
the  nodes.  What  kind  of  information  could  we  get 
out  of  this?  We  describe  below  how  WebWatcher  im¬ 
plements  a  function  to  find  related  web  pages  using 
only  this  structural  information  about,  Gwww-  An 
example  of  the  use  of  this  functionality  is  given  in 
figures  4  and  5.  In  figure  4  the  user  is  at  the  “Wel¬ 
come  to  ILPNET”  page,  clicks  on  the  button  “ Mark 
this  page  as  interesting” ,  and  is  next,  presented  with 
the  screen  depicted  in  figure  5,  which  suggests  related 
web  pages. 

3.2  Representation 

Web  pages  are  purposely  designed  to  structure  the 
user’s  search  for  information.  Often  web  pages  repre¬ 
sent,  collections  of  hyperlinks  about,  certain  topics.  An 
example  is  the  “Machine  Learning  Resources  Page” 
shown  in  figure  1.  Other  pages  reflect,  the  structure 
of  organizations  or  the  interests  of  users.  Many  peo¬ 
ple  put,  hyperlinks  they  find  interesting  into  their  per¬ 
sonal  home  page. 

Ignoring  its  content,,  one  way  we  can  describe  a, 
page  in  the  WWW  is  in  terms  of  which  other  pages 
contain  hyperlinks  to  it.  In  natural  language  such 
a,  description  for  the  “WWatcher”  page  in  figure  6 
would  look  like  this: 

The  “WWatcher”  page  is  a  page  which  Tom  thinks 
is  interesting,  because  Tom  has  a  hyperlink  to  it  in 
his  personal  home  page.  The  same  is  true  of  Dayne 
and  Thorsten.  Furthermore  the  “WWatcher"  page  is 
about  a  project  at  CMU,  because  it  appears  in  the  list 
of  projects.  It  is  about  Machine  Learning,  because  it 
is  referred  to  from  the  “ Machine  Learning  Resources” 
page. 

Below  we  describe  an  algorithm  which  works  under 
the  assumption  that,  two  pages  are  of  similar  inter¬ 
est,  if  some  third  page  points  to  them  both.  Similar 
ideas  have  previously  been  explored  in  Information 
Retrieval  [Small,  1973].  The  performance  of  query- 
based  Information  Retrieval  systems  has  been  im¬ 
proved  using  structural  information  from  hypertext, 
[Savoy,  1992]. 


3.3  Algorithm 

The  problem  we  are  facing  is  similar  to  the  problem 
of  Collaborative  Filtering  [Resnick,  1994].  The  tar¬ 
get  function  we  want  to  learn  is  a  mapping  from  an 
arbitrary  web  page  to  a  set  of  related  pages: 

Related  :  page  —>■  {related  pages} 

WebWatcher  uses  a  nearest  neighbor  approach  to 
approximate  this  target  function.  To  illustrate,  con¬ 
sider  the  example  web  fragment  from  figure  6.  The 
following  matrix  describes  the  hyperlinks  in  this  frag¬ 
ment  of  the  web. 


page 

WW  atcher 

hyperlink  t 
LearnLab 

j 

ILPNet 

Tom 

1 

1 

0 

Dayne 

1 

1 

0 

Thor  sten 

1 

1 

0 

Proj  ects 

1 

1 

0 

K  atharina 

0 

0 

1 

M  LResour 

1 

0 

1 

The  1  in  the  row  of  Tom  and  in  the  column  of 
WWatcher  says  that  the  Tom  page  contains  a  link  to 
the  page  WWatcher.  Imagine  we  want  to  find  pages 
related  to  the  WWatcher  page.  As  stated  above,  our 
assumption  is  that  pages  which  are  referred  to  from 
the  same  pages  are  related.  This  means  that  we  have 
to  look  at  the  columns  of  the  matrix  and  find  the  ones 
most  similar  to  the  WWatcher  column.  The  pages 
associated  with  the  n  most  similar  columns  are  re¬ 
turned  by  Related. 

We  use  Mutual  Information  [Quinlan,  1993]  as  a 
similarity  measure  for  comparing  columns,  which  we 
discuss  in  section  3.4.  In  our  example  Mutual  Infor¬ 
mation  measures — intuitively  speaking — how  well  the 
occurrence  of  particular  hyperlink  in  a  page  predicts 
the  occurrence  of  another  hyperlink  in  the  same  page 
(e.  g.  “How  well  does  the  occurrence  of  a  hyperlink 
to  the  LearnLab  page  predict  the  occurrence  of  a  hy¬ 
perlink  to  the  WWatcher  page?”).  Let  V  be  a  set 
of  feature  vectors.  Each  element  of  V  corresponds  to 
a  row  of  the  above  matrix.  Each  feature  vector  rep¬ 
resents  a  web  page  with  associated  attributes,  each 
attribute  corresponding  to  the  existence  or  absence 
of  an  outgoing  hyperlink  to  another  particular  page. 
We  will  use  the  naming  convention  that  lj  stands  for 
any  hyperlink  to  page  Pi.  For  our  problem  the  mutual 
information  of  hyperlink  lj  with  respect  to  hyperlink 
lj,  over  the  set  of  web  pages  V ,  is: 

I{V ,  h  ,lj)  =  E(V,  li)&m*E(V+ ,  li)&^*E(V- ,  h ) 
m  m 

where  T>+  is  the  set  of  pages  in  V  containing 
hyperlink  lj ,  and  X>_  is  the  set  not  containing  lj . 
m  =  card(T>)  is  the  number  of  pages  in  V .  m+  = 
card(T>+),  m_  =  card(T> _).  E(X ,f)  is  the  entropy 
function  with  respect  to  attribute  lj. 

In  order  to  consider  hyperlinks  lj  and  lj  similar  in 
our  problem,  we  require  that  they  obey  an  additional 


condition.  A  binary  attribute  lj  can  predict  the  oc¬ 
currence  of  another  binary  attribute  lj  in  two  ways. 
Either  lj  is  true  whenever  f  is  true,  or  lj  is  comple¬ 
mentary  to  lj.  To  avoid  regarding  attributes  f  and 
lj  as  similar  although  they  are  complementary,  they 
have  to  satisfy  the  condition,  NonComplfD,  lj,  lj). 

N onC'omp(T> ,  f ,  L)  <e>  Lt-  >  — — 
m_|_  m_ 

p. |_  is  the  number  of  pages  in  T>+  which  contain  hy¬ 
perlink  lj.  p _  is  the  number  of  pages  in  which 
contain  hyperlink  lj. 

The  following  algorithm  implements  the  target 
function  Related.  It  returns  the  n  most  related  pages 
for  a  given  page  Pjn: 

Input:  page  Pin  <E  V ,  output  length,  n 

•  for  all  hyperlinks  f  which 

satisfy  N onComp{V ,ljn,lj): 

•  Calculate  I(fD,ljn,lj) 

Output :  the  n  pages  to  which  the 

hyperlinks  with  highest  mutual 
information  point 

The  algorithm  can  intuitively  be  extended  to  use 
multiple  input  pages  Pjni,  . .  .  Pinm-  The  evaluation 
of  a  hyperlink  lj  is  then  the  sum  of  the  mutual  infor¬ 
mation  of  lj  and  each  element  of  {hni,  . .  .  hnm}- 

3.4  Minimum  Description  Length 
Interpretation 

In  this  section  we  motivate  our  choice  of  mututal  in¬ 
formation  as  a  similarity  function.  Our  argument  is 
based  on  the  Minimum  Description  Length  (MDL) 
principle  [Rissanen,  1978].  This  principle  says  that 
in  a  machine  learning  problem  one  should  prefer  the 
hypothesis  H  which  minimizes  the  number  of  bits 
needed  to  encode  the  labelling  of  the  training  exam¬ 
ples  (given  hypothesis  H)  plus  the  number  of  bits 
needed  to  encode  the  hypothesis  H  itself.  Unlike 
Maximum  Likelihood  methods,  the  MDL  principle 
models  the  trade-off  between  training  error  and  hy¬ 
pothesis  complexity.  More  precisely  the  number  of 
bits  saved  by  having  to  encode  fewer  exceptions  from 
H  is  compared  with  the  number  of  bits  it  costs  to  en¬ 
code  a  more  complex  hypothesis.  The  Minimum  De¬ 
scription  Length  principle  can  be  derived  from  Bayes 
theorem  (see,  e.g.,  [Lang,  1995]). 

To  apply  MDL  to  our  problem,  consider  the  target 
function 

ContainsHyperlink  ljn  :  page  —>■  {0, 1} 

This  function  predicts  whether  a  page  contains  hy¬ 
perlink  ljn .  The  hypothesis  language  Cu  we  use  for 
the  prediction  task  is  very  simple.  It  consists  only  of 
single  attributes  which  are  all  the  hyperlinks  lj  except 
ljn  itself. 

C'U  =  {h  |  Pi  &V}  -0-{4'n} 

This  means  that  we  can  express  hypotheses  of  the 
form  “hyperlink  fn  occurs  in  page  -O-  hyperlink  lj 


occurs  in  page” .  According  to  the  MDL  principle 
the  hypothesis  h  £  C-h  ,  which  minimizes  the  de¬ 
scription  length  of  the  target  values  for  the  train¬ 
ing  examples  of  ContainsHyperlink  lin  given  h,  plus 
the  description  length  of  h,  should  be  used  for  ap¬ 
proximating  ContainsHyperlink  lin.  This  hypothe¬ 
sis  is  most  likely  to  predict  best  which  pages  contain 
a  hyperlink  /8n.  Under  the  assumption  that  pages 
which  are  referred  to  from  the  same  pages  are  simi¬ 
lar  the  page  associated  with  the  best  hypothesis  for 
ContainsHyperlink  lin  has  the  highest  probability  of 
being  most  similar  with  _P8n  in  our  model. 

The  algorithm  described  in  the  previous  section 
corresponds  to  applying  the  MDL  principle  to  C-h, 
under  some  simplifying  assumptions.  In  particular, 
if  one  assumes  that  the  prior  probability  distribution 
over  hypotheses  in  C-h  is  uniform,  then  the  descrip¬ 
tion  length  is  the  same  for  each  hypothesis.  This  al¬ 
lows  ignoring  the  description  length  of  the  hypothesis, 
and  focusing  only  on  the  description  of  target  values 
given  the  hypothesis.  Furthermore  we  can  maximize 
the  reduction  in  description  length  instead  of  mini¬ 
mizing  the  absolute  number  of  bits.  Mutual  informa¬ 
tion  is  proportional  to  this  reduction. 

The  MDL  analysis  suggests  that  a  better  algorithm 
can  be  derived  if  these  simplifying  assumptions  were 
not  made.  For  predicting  how  interesting  another 
page  is  given  that  the  user  likes  the  current  page  a 
prior  distribution  over  hypotheses  in  C-h  derived  from 
the  frequencies  of  hyperlinks  occuring  in  the  World 
Wide  Web  would  fully  exploit  the  MDL  framework. 

4  Results 

This  paper  describes  work  in  progress.  At  this  point 
we  do  not  have  final  evaluations  of  the  effectiveness  of 
this  algorithm  in  suggesting  interesting  related  pages, 
although  preliminary  tests  are  encouraging.  Never¬ 
theless  it  is  possible  for  the  reader  to  try  the  algorithm 
as  part  of  WebWatcher.  The  system  is  accessible  to 
the  public  from  the  WebWatcher  Home  Page  at  URL 
http : //www . cs . emu. edu/af s/ cs/proj  ect/theo-6/ 
web-agent/www/proj  ect-home .html. 

5  Conclusion 

WebWatcher  is  a  flexible  interface  which  allows  learn¬ 
ing  from  users’  actions  and  assists  them  interac¬ 
tively  while  they  are  browsing  the  World  Wide  Web. 
We  have  shown  here  an  algorithm  that  extracts  and 
uses  information  stored  in  the  structure  of  hypertext, 
without  considering  the  text  itself.  We  conjecture 
that  the  use  of  structural  information  can  improve 
text  based  methods  in  many  cases. 

6  Future  Work 

Our  next  goal  is  to  find  adequate  measures  for  testing 
the  performance  of  the  algorithm.  We  also  plan  to 
explore  how  to  use  other  structural  information  such 
as  outgoing  links  of  a  page.  Furthermore,  we  plan 


to  apply  the  algorithm  to  problems  of  collaborative 
filtering. 
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